Anomaly detection in text

ABSTRACT

Systems and techniques are generally described for anomaly detection in text. In some examples, text data comprising a plurality of words may be received. An image of a first word of the plurality of words may be generated. A feature representation of the first word may be generated using a variational autoencoder. A score may be generated based at least in part on the feature representation. In various examples, the score may indicate a likelihood that an appearance of the first word in the image of the first word is anomalous with respect to at least some other words of the plurality of words.

BACKGROUND

The font of a body of printed text may be altered within the body oftext for various purposes. For example, italics, bolding, underlining,changing of typeface, etc., may be used to provide emphasis and/or todraw the reader's attention to particular parts of the text. In anexample, a label of a food product may have allergens (peanuts, milk,soy, wheat, etc.) depicted in bold among other non-bolded ingredients inorder to draw the reader's attention to such terms. Similarly, thedescription of an item for sale in an online catalog may also usedifferent fonts to highlight certain ingredients or aspects of the item.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1A is a block diagram depicting an example system for automaticallyvalidating food labels, in accordance with various aspects of thepresent disclosure.

FIG. 1B is a block diagram depicting an example system effective toperform anomaly detection in text, according to various embodiments ofthe present disclosure.

FIG. 2 depicts use of statistical techniques for anomaly detection intext, in accordance with various aspects of the present disclosure.

FIG. 3 depicts an example of training of an example system effective toperform anomaly detection in text, in accordance with various aspects ofthe present disclosure.

FIG. 4 depicts an example process that may be used to perform anomalydetection in text, in accordance with various aspects of the presentdisclosure.

FIG. 5 is a block diagram showing an example architecture of a computingdevice that may be used in accordance with various aspects of thepresent disclosure.

FIG. 6 is a diagram illustrating an example system for sending andproviding data that may be used in accordance with the presentdisclosure.

FIG. 7 is a block diagram illustrating another example system effectiveto perform anomaly detection in text, according to various aspects ofthe present disclosure.

DETAILED DESCRIPTION

In the following description, reference is made to the accompanyingdrawings that illustrate several examples of the technology describedherein. It is understood that other examples may be utilized and variousoperational changes may be made without departing from the scope of thepresent disclosure. The following detailed description is not to betaken in a limiting sense, and the scope of the embodiments describedherein is defined only by the claims of the issued patent.

E-commerce services may offer products from a variety of different thirdparty sellers. In some examples, the products may be required to conformto various regulatory requirements. For example, food products and/orother ingestible products sold within the United States (and/or withinother nations) may be required by regulatory bodies to display aningredient list. Further, in some examples, ingredients that are knownallergens (e.g., ingredients identified by the regulatory body asallergens) may be required to be bolded and/or otherwise emphasizedwithin the ingredient list. For example, a soup containing water,carrots, onions, milk, and garlic may be required to bold the term“milk” or otherwise emphasize the term “milk,” as milk is identified bythe United States Food and Drug Administration (FDA) as an allergen.

Although many examples described herein are related to detecting boldedor otherwise emphasized words in food products and/or ingestibleproducts, the systems and techniques described herein may be used todetect anomalous text in any context. However, for ease of illustration,the examples provided herein typically use the example of ingredientand/or component lists.

The systems and techniques described herein are configured toautomatically detect words having anomalous appearance within a body ofinput text. “Anomalous,” in this context, refers to words depicted in afont that differs from the font of at least some other words within thebody of input text. The font may differ in variety of ways, such as,e.g., in size, weight, style, and typeface. For example, in thefollowing list: “Water, Carrots, Onions, Red Lentils (4.5%) Potatoes,Cauliflower, Leeks, Peas, Cornflower, Wheat, Cream (milk), Peanuts,Sunflower Oil” the terms “Wheat,” “milk,” and “Peanuts” may be anomalousas these terms are either bolded (in the case of “Wheat” and “milk”) orare both bolded and italicized (e.g., “Peanuts”). The frequency ofbolded and/or italicized words is uncommon within the input list, as themajority of the words are not bolded or italicized. Thus, these wordsmay be anomalous with respect to at least some of the other words withinthe text.

Various techniques described herein are effective to determine anomaloustext in terms of the appearance of the text (e.g., the appearance of thefont) relative to the rest of the input text. The “anomaly” describedherein does not refer to the semantic meaning of text with respect toother text, but rather the graphical depiction of the text. Althoughdetecting such anomalous text is an easy task for human visual acuity,development of an automated system that can detect anomalous text isnon-trivial. For one, heuristic systems may fail as the relativeappearance of text may vary from one corpus of input text to another.Thus, detecting bolded words (or other anomalous words) may not be keyedon the size and/or thickness of the font, since fonts use differentthicknesses. Further, many machine learning language models may be proneto overfitting on training data and may learn features that semanticallyrepresent the text rather than focusing on the appearance of the text.Further, object detection methodologies may have similar issues as theheuristic-based approaches mentioned above, where the object detectormay base decision making on the size and/or thickness of the font and/orotherwise may have issues with overfitting on the training data.

Accordingly, use of rule-based systems and/or traditional“off-the-shelf” machine learning models may be insufficient for the taskof anomaly detection in text. Accordingly, in various examples describedherein, specialized machine learning-based approaches may be employedfor anomaly detection in text. For example, the various systemsdescribed herein may leverage Deep Neural Networks, regression models,etc.

Machine learning techniques, such as those described herein, are oftenused to form predictions, solve problems, recognize objects in imagedata for classification, similarity determination, etc. For example,herein machine learning techniques may be used to determine similaritiesbetween images associated with items and images associated with brands.Further, herein machine learning techniques may be used to determinesimilarity between text associated with items and text associated withbrands. In various examples, machine learning models may perform betterthan rule-based systems and may be more adaptable as machine learningmodels may be improved over time by retraining the models as more andmore data becomes available. Accordingly, machine learning techniquesare often adaptive to changing conditions. Deep learning algorithms,such as neural networks, are often used to detect patterns in dataand/or perform tasks.

Generally, in machine learned models, such as neural networks,parameters control activations in neurons (or nodes) within layers ofthe machine learned models. The weighted sum of activations of eachneuron in a preceding layer may be input to an activation function(e.g., a sigmoid function, a rectified linear units (ReLu) function,etc.). The result determines the activation of a neuron in a subsequentlayer. In addition, a bias value can be used to shift the output of theactivation function to the left or right on the x-axis and thus may biasa neuron toward activation.

Generally, in machine learning models, such as neural networks, afterinitialization, annotated training data may be used to generate a costor “loss” function that describes the difference between expected outputof the machine learning model and actual output. The parameters (e.g.,weights and/or biases) of the machine learning model may be updated tominimize (or maximize) the cost. For example, the machine learning modelmay use a gradient descent (or ascent) algorithm to incrementally adjustthe weights to cause the most rapid decrease (or increase) to the outputof the loss function. The method of updating the parameters of themachine learning model is often referred to as back propagation.Generally, in machine learning, an embedding is a mapping of a discrete,categorical variable to a vector of continuous numbers.

In various examples, the systems and techniques described herein maytake an image that includes text (e.g., an image of a food productlabel) as input. An optical character recognition (OCR) technique isthen used to separate the image of the text into a set of images of eachword of the text (e.g., each image of the set is an image of anindividual word from the text). A multi-task learning approach is usedthat comprises 1) a variational autoencoder (VAE) used to generate alatent representation of the text attributes, and 2) a classifier (e.g.,a sigmoid classifier or other classifier network) for attributeclassification (e.g., bold vs. non-bold, italicized vs. non-italicized,etc.) on the word-level images. The system is trained end-to-end formulti-task learning using a decoder network to minimize bothclassification loss and reconstruction loss. As described herein, theVAE may include some skip connections to the decoder in order todirectly carry some information from the encoder to the decoder. Themulti-task learning approach and the skip connections aid the VAE inlearning embeddings that represent the visual appearance of the font,rather than semantic information, in order to reduce over-fitting andimprove performance.

Further, for each of the word-level images, a score is output by theclassifier (e.g., between 0 and 1). Accordingly, for the input body oftext (e.g., from a product label), a set S of such scores is generated.Simple thresholding can be used to determine anomalous text, or aclustering strategy may be used (e.g., z-score) in order to detectoutliers.

FIG. 1A is a block diagram depicting an example system for automaticallyvalidating food labels, in accordance with various aspects of thepresent disclosure. In various examples, an image of an ingredient list160 (and/or other text data) may be input into optical characterrecognition (OCR) 130. OCR 130 may be used to identify the individualwords in the body of text represented by the image of the ingredientlist 160. Accordingly, individual images of words 132 may be generated.Words may be identified by the OCR 130 as groupings of charactersseparated from other groupings of characters by one or more spaces. Forexample, the OCR 130 may generate bounding box information thatspecifies the location of individual words within the ingredient list160. The bounding box information may be used to generate images ofindividual words (e.g., images of words 132) from the image of theingredient list 160.

In various examples described herein, a system 100 (FIG. 1B) may be usedto automatically determine bolded and/or otherwise anomalous words froman image that represents a plurality of words. In various examples, thesystem 100 may include an encoder 134 and/or a classifier 140. Thesecomponents are described in further detail below. In various examples,the encoder 134 and/or classifier 140 may be used to determine whetherthe input images of words 132 are bolded (action 174) or are otherwiseanomalous with respect to the appearance of at least some other words inthe ingredient list 160.

Additionally, an allergen list 170 may comprise a list of words (e.g.,text data) that are required by an applicable law or regulation to bebolded when listed as part of an ingredient list for food. Accordingly,for a given word of the ingredient list, a determination may be made(action 172) whether the given word is on the allergen list 170 andtherefore should be bolded when listed on an ingredient label (e.g.,ingredient list 160). For example, text data representing the image ofthe ingredient list 160 may be generated using OCR 130. The individualwords of the text data may be compared to the allergen list 170 todetermine whether any of the words are known allergens.

At comparison step 176 a determination of whether an input word-levelimage (e.g., “Peanuts”) is bolded may be compared to the determination(at action 172) of whether the word should be bolded due to the wordbeing on allergen list 170. Accordingly, at action 178 a determinationmay be made to determine whether, for the input ingredient list 160,whether known allergens that are on allergen list 170 are bolded in theingredient list. If all allergens are detected as being bolded (orotherwise emphasized as required by the applicable law(s)) adetermination may be made at action 180 that the listing of a productthat is associated with the ingredient list 160 (e.g., a product labeledwith the ingredient list) may be allowed to be listed on an e-commercesite. Otherwise, if one or more allergens is not bolded, contravening anapplicable law or regulation, the listing of the product may be denied,and/or an alert may be generated (at action 182) to inform the sellerthat one or more allergens is un-bolded and thereby violates anapplicable food safety law or regulation. The automatic determination ofwhether an individual image of a word appears bolded or is otherwiseanomalous with respect to other words in the input image of text isdescribed in further detail below.

FIG. 1B is a block diagram depicting an example system 100 effective toperform anomaly detection in text, according to various embodiments ofthe present disclosure. In various examples, one or more computingdevices 120 may be used to implement the system 100 and/or thetechniques described herein. Computing devices 120 may communicate withone another and/or with one or more of the other components depicted inFIG. 1B over a network 104 (e.g., a local area network (LAN) and/or awide area network (WAN) such as the internet). For example, computingdevices 120 may communicate with a non-transitory computer-readablememory 103 over network 104. In various examples, the non-transitorycomputer-readable memory 103 may store instructions that, when executedby at least one processor of the computing devices 120, may cause thecomputing devices to perform the various anomaly detection techniquesdescribed herein.

An image of text data 128 (e.g., a product label and/or any other imageof a body of text) may be input into system 100. The system 100 may useOCR 130 to identify the individual words in the body of text representedby the image of text data 128. Accordingly, individual images of words132 may be generated. Words may be identified by the OCR 130 asgroupings of characters separated from other groupings of characters byone or more spaces. Accordingly, acronyms, such as the “AAPA” depictedin FIG. 1B may be detected as a word.

The images of words 132 may comprise two-dimensional pixel gridsrepresenting words from the image of text data 128. Each image of a wordof the images of words 132 may be passed into encoder 134. Encoder 134may be a VAE. In VAEs, encoders and decoders may be neural networks. Theencoder encodes the input word image as a distribution over a latentspace. The latent variable representation T is considered to have astandard Gaussian distribution P(T)=N(0, 1). The conditionaldistribution of the input word image X generated by the encoder 134 isalso a Gaussian distribution with independent components (e.g., adiagonal covariance matrix). The conditional distribution (e.g., afeature representation of the input word image) is denoted as P(X|T).During training, the decoder 138 is a neural network that reconstructsthe input image (e.g., generates reconstruction data) on the basis ofP(X|T). The parameters of the encoder 134 and the decoder 138 areupdated during training to minimize the reconstruction loss 150 (e.g.,the quantified difference between the input image of the word and thecorresponding reconstructed image (of reconstructed images 146).

Generally, VAEs are trained to provide regularization of the latentspace to prevent over-fitting of the model to the training data. Insteadof encoding the input as a single point, the input is encoded as adistribution over the latent space. Accordingly, during training, theencoder 134 encodes the input image of a word as a distribution P(X|T)over the latent space. A point from the latent space is sampled from thedistribution P(X|T). The sampled point is decoded by the decoder 138 andthe reconstruction loss 150 is computed. This reconstruction loss 150may be back-propagated through the encoder/decoder network in order toupdate the parameters.

However, system 100 may be trained using multi-task learning. Inparticular, the classifier 140 may be trained end-to-end with theencoder 134 and decoder 138. The classifier 140 may take the meanparameter from the conditional distribution P(X|T) as input and maygenerate scores 142 indicating whether the current input corresponds toa particular class. The scores 142 may be used to determine anomaly inthe text appearance. For example, a simple threshold may be used (e.g.,scores over 0.8 may be classified as bold text) or statisticalclustering-based method (e.g., FIG. 2 , below) may be used to classifythe text into output data 144 indicating whether or not the textcorresponds to some anomalous class with respect to the body of textrepresented by the image of text data 128. Classifier loss 148 may bedetermined based on ground truth data (e.g., labels) of the trainingdata (e.g., a binary label indicating bold or not bold, italicized ornot italicized, etc.).

As described in further detail below in reference to FIG. 3 , theencoder 134, decoder 138, and classifier 140 may be trainedsimultaneously by minimizing a weighted combination of reconstructionloss 150 and classifier loss 148. Skip connections 154 may persistinformation from layers of the encoder 134 to the decoder 138 to helpprevent overfitting of the learned feature representation (e.g., P(X|T))to the training data.

In various examples, there may be several classifiers 140 and/orclassifier heads. The different classifiers 140 may each be trained todetect a different type of anomalous text. For example, a firstclassifier head may be trained to detect italicized text, a secondclassifier head may be trained to detect highlighted text, etc.

In an alternate embodiment, the system 100 may employ two encoders 134 -one for each class (e.g., bold/non-bold). A distance metric may be usedfor the latent representation. The two encoders 134 may employ weightsharing, in some examples. During training the multi-task training mayleverage the label (bold/non-bold) together with the distanceconstraint. After training, the system 100 may use the distance metricto distinguish between the two classes for an input word.

In various examples, a feature that is proportional to the length of theword may be added. The size of the padding of a word (e.g., thehorizontal spread of pixels of the characters of the word) may becomputed relative to the image height (in pixels). This proportionincreases as the word is longer. This number is used as a condition inthe VAE and as input for the classifier. This feature may help to makethe feature representations generated by the encoder 134 (P(X|T)independent of the length of the word. The latent representation shouldnot capture information about the relative length of the word since itis not relevant to the classification task.

FIG. 2 depicts use of statistical techniques for anomaly detection intext, in accordance with various aspects of the present disclosure. Insome examples, a static threshold may be used to determine whether theappearance of a word is anomalous with respect to the input body oftext. For example, classifier 140 may output scores between 0 and 1. Athreshold of 0.85 (or any other suitable value) may be used todistinguish between bolded and non-bolded text with words associatedwith scores over 0.85 being classified as bold and words associated withscores under 0.85 being classified as non-bold. This is merely anexample, the actual threshold is implementation specific. For example,instead of the example implementation described above, bold words may bethose words with classifier 140 scores under 0.5, over 0.5, and/or maybe classified on the basis of either being over, equal to, or under anysuitable value.

In some other examples, since font size, shape, and appearance maydiffer with respect to other font size, shape, and appearance, acluster-based methodology may be used to classify the text asanomalous/non-anomalous on the basis of the scores 142 output byclassifier 140. For example, the mean (e.g., the mean score) andstandard deviation σ of the scores 142 may be determined and z-scoresmay be used as outliers to detect the anomalous text. In the example ofFIG. 2 outliers 202 may be those scores 142 with σ≥2 (although differentthresholds may be used in accordance with the desired implementation). Zscores may be given by:

$Z = \frac{x - \mu}{\sigma}$

where x is an individual score 142 and μ is the mean of scores 142.

Typically, the ingredients on a product label have the same font, size,color, background, etc., and only the allergens are bolded. In someexamples, the font itself may be relatively thick (e.g., relative toother commonly used fonts). In such a case, a simple thresholdingtechnique using a static threshold for system 100 may determine that allthe words in such a label are bold since all words may have scores thatexceed the static threshold. The outlier detection methodology (e.g.,z-score) described above avoids such scenarios. In various examples, thesubset of scores with a z-score that exceeds a pre-determined value(e.g., >2σ) may be determined to be anomalous with respect to at leastsome other words of the input text.

FIG. 3 depicts an example of training of an example system effective toperform anomaly detection in text, in accordance with various aspects ofthe present disclosure. As previously described, the encoder 134,decoder 138, and classifier 140 may be trained simultaneously usingend-to-end training 310 based on a weighted sum of classifier loss 148and reconstruction loss 150.L _(total) =w ₁ MSE _(autocoder) +w ₂ BCE _(classifier)

where w₁, w₂ are scalar weights, MSE_(autoencoder) is mean square errorloss for the autoencoding task and BCE_(classifier) is the binary crossentropy for the classification task. Using this multi-task traininghelps the encoder 134 to learn feature representations (e.g., conditiondistributions) that represent useful features of the input words for thetext anomaly detection task and helps to prevent over-fitting to thetraining data. For example, using traditional classifiers and/orencoders, features may be learned that represent the semantic meaning ofthe word. Accordingly, the next time the same word is seen as a wordthat the network has learned is an anomaly, the network may classify theword as anomalous regardless of how the font of the word appears withrespect to the body of text.

FIG. 4 depicts an example process 400 that may be used to performanomaly detection in text, in accordance with various aspects of thepresent disclosure. Those actions in FIG. 4 that have been previouslydescribed in reference to FIGS. 1-3 may not be described again hereinfor purposes of clarity and brevity. The actions of the process depictedin the flow diagram of FIG. 4 may represent a series of instructionscomprising computer-readable machine code executable by one or moreprocessing units of one or more computing devices. In various examples,the computer-readable machine codes may be comprised of instructionsselected from a native instruction set of and/or an operating system (orsystems) of the one or more computing devices. Although the figures anddiscussion illustrate certain operational steps of the system in aparticular order, the steps described may be performed in a differentorder (as well as certain steps removed or added) without departing fromthe intent of the disclosure.

Process 400 may begin at action 402, at which first text data comprisinga plurality of words may be received. In various examples, the firsttext data may be received as an image of text data (e.g., an image of aparagraph of text, an image of a product label, etc.).

Processing may continue at action 404, at which an image of a first wordof the plurality of words may be detected. In various examples, thewords of the plurality of words may be separated into images ofindividual words. As previously described, individual words of the bodyof words may be detected using OCR to separate the words on the basis ofspaces between characters.

Processing may continue at action 406, at which a feature representationof the first word may be determined. In various examples, a VAE may beused to generate the feature representation. The feature representationmay comprise a conditional distribution P(X|T) for the input word mappedto a latent distribution learned by the VAE during multi-task training.The mean parameter of the conditional distribution P(X|T) may bedetermined and/or a point from the conditional distribution P(X|T) maybe sampled for input into a classifier.

Processing may continue at action 408, at which a score may be generatedbased at least in part on the feature representation. The score mayindicate a likelihood (e.g., a probability) that an appearance of thefirst word in the image of the first word is anomalous with respect toat least some other words of the plurality of words. For example, asigmoid classifier and/or other regression based classifier may be usedto generate scores 142 which may, in turn, be used to classify the inputword (e.g., the first word of the plurality of words) as eitheranomalous with respect to other words of the input text, ornon-anomalous.

Processing may continue at action 410, at which output data indicatingthat the first word is of a first class may be generated. The outputdata may be a classification of the input word (e.g., the first word ofaction 404). Various techniques may be used to generate the output datathat classifies the input word, as described herein. For example, astatic threshold may be used and/or a clustering algorithm may be usedto determine whether the score is an outlier. Accordingly, the score(and input word) may be classified on the basis of its z-score (orsimilar). For example, the classifier may be used to determine whether aword is bolded, italicized, highlighted, underlined, of a different fontwith respect to the other words of the input text, etc. Additionally, insome examples, multiple classifier heads may be used to detect variousdifferent types of anomalies in the input text (e.g., italics, boldness,etc.).

FIG. 5 is a block diagram showing an example architecture 500 of acomputing device that may be used to instantiate the text anomalydetection techniques described herein. For example, architecture 500 ofa computing device may be effective to implement various machinelearning models such as the VAEs, classifiers, etc., described herein,in accordance with various aspects of the present disclosure. It will beappreciated that not all devices will include all of the components ofthe architecture 500 and some user devices may include additionalcomponents not shown in the architecture 500. The architecture 500 mayinclude one or more processing elements 504 for executing instructionsand retrieving data stored in a storage element 502. The processingelement 504 may comprise at least one processor. Any suitable processoror processors may be used. For example, the processing element 504 maycomprise one or more digital signal processors (DSPs). The storageelement 502 can include one or more different types of memory, datastorage, or computer-readable storage media devoted to differentpurposes within the architecture 500. For example, the storage element502 may comprise flash memory, random-access memory, disk-based storage,etc. Different portions of the storage element 502, for example, may beused for program instructions for execution by the processing element504, storage of images or other digital works, and/or a removablestorage for transferring data to other devices, etc. Additionally,storage element 502 may store parameters, and/or machine learning modelsgenerated using the various techniques described herein.

The storage element 502 may also store software for execution by theprocessing element 504. An operating system 522 may provide the userwith an interface for operating the computing device and may facilitatecommunications and commands between applications executing on thearchitecture 500 and various hardware thereof. A transfer application524 may be configured to receive images, audio, and/or video fromanother device (e.g., a mobile device, image capture device, and/ordisplay device) or from an image sensor 532 and/or microphone 570included in the architecture 500.

When implemented in some user devices, the architecture 500 may alsocomprise a display component 506. The display component 506 may compriseone or more light-emitting diodes (LEDs) or other suitable displaylamps. Also, in some examples, the display component 506 may comprise,for example, one or more devices such as cathode ray tubes (CRTs),liquid-crystal display (LCD) screens, gas plasma-based flat paneldisplays, LCD projectors, raster projectors, infrared projectors orother types of display devices, etc. As described herein, displaycomponent 506 may be effective to display suggested personalized searchqueries generated in accordance with the various techniques describedherein.

The architecture 500 may also include one or more input devices 508operable to receive inputs from a user. The input devices 508 caninclude, for example, a push button, touch pad, touch screen, wheel,joystick, keyboard, mouse, trackball, keypad, light gun, gamecontroller, or any other such device or element whereby a user canprovide inputs to the architecture 500. These input devices 508 may beincorporated into the architecture 500 or operably coupled to thearchitecture 500 via wired or wireless interface. In some examples,architecture 500 may include a microphone 570 or an array of microphonesfor capturing sounds, such as voice requests. In various examples, audiocaptured by microphone 570 may be streamed to external computing devicesvia communication interface 512.

When the display component 506 includes a touch-sensitive display, theinput devices 508 can include a touch sensor that operates inconjunction with the display component 506 to permit users to interactwith the image displayed by the display component 506 using touch inputs(e.g., with a finger or stylus). The architecture 500 may also include apower supply 514, such as a wired alternating current (AC) converter, arechargeable battery operable to be recharged through conventionalplug-in approaches, or through other approaches such as capacitive orinductive charging.

The communication interface 512 may comprise one or more wired orwireless components operable to communicate with one or more othercomputing devices. For example, the communication interface 512 maycomprise a wireless communication module 536 configured to communicateon a network, such as the network 604, according to any suitablewireless protocol, such as IEEE 802.11 or another suitable wirelesslocal area network (WLAN) protocol. A short range interface 534 may beconfigured to communicate using one or more short range wirelessprotocols such as, for example, near field communications (NFC),Bluetooth, Bluetooth LE, etc. A mobile interface 540 may be configuredto communicate utilizing a cellular or other mobile protocol. A GlobalPositioning System (GPS) interface 538 may be in communication with oneor more earth-orbiting satellites or other suitable position-determiningsystems to identify a position of the architecture 500. A wiredcommunication module 542 may be configured to communicate according tothe USB protocol or any other suitable protocol.

The architecture 500 may also include one or more sensors 530 such as,for example, one or more position sensors, image sensors, and/or motionsensors. An image sensor 532 is shown in FIG. 5 . Some examples of thearchitecture 500 may include multiple image sensors 532. For example, apanoramic camera system may comprise multiple image sensors 532resulting in multiple images and/or video frames that may be stitchedand may be blended to form a seamless panoramic output. An example of animage sensor 532 may be a camera configured to capture colorinformation, image geometry information, and/or ambient lightinformation.

As noted above, multiple devices may be employed in a single system. Insuch a multi-device system, each of the devices may include differentcomponents for performing different aspects of the system's processing.The multiple devices may include overlapping components. The componentsof the computing devices, as described herein, are exemplary, and may belocated as a stand-alone device or may be included, in whole or in part,as a component of a larger device or system.

An example system for sending and providing data will now be describedin detail. In particular, FIG. 6 illustrates an example computingenvironment in which the embodiments described herein may beimplemented. For example, the computing environment of FIG. 6 may beused to provide the various machine learning models described herein asa service over a network wherein one or more of the techniques describedherein may be requested by a first computing device and may be performedby a different computing device configured in communication with thefirst computing device over a network. FIG. 6 is a diagram schematicallyillustrating an example of a data center 65 that can provide computingresources to users 60 a and 60 b (which may be referred hereinsingularly as user 60 or in the plural as users 60) via user computers62 a and 62 b (which may be referred herein singularly as user computer62 or in the plural as user computers 62) via network 104. Data center65 may be configured to provide computing resources for executingapplications on a permanent or an as-needed basis. The computingresources provided by data center 65 may include various types ofresources, such as gateway resources, load balancing resources, routingresources, networking resources, computing resources, volatile andnon-volatile memory resources, content delivery resources, dataprocessing resources, data storage resources, data communicationresources and the like. Each type of computing resource may be availablein a number of specific configurations. For example, data processingresources may be available as virtual machine instances that may beconfigured to provide various web services. In addition, combinations ofresources may be made available via a network and may be configured asone or more web services. The instances may be configured to executeapplications, including web services, such as application services,media services, database services, processing services, gatewayservices, storage services, routing services, security services,encryption services, load balancing services, application services andthe like. In various examples, the instances may be configured toexecute one or more of the various machine learning techniques describedherein.

These services may be configurable with set or custom applications andmay be configurable in size, execution, cost, latency, type, duration,accessibility and in any other dimension. These web services may beconfigured as available infrastructure for one or more clients and caninclude one or more applications configured as a system or as softwarefor one or more clients. These web services may be made available viaone or more communications protocols. These communications protocols mayinclude, for example, hypertext transfer protocol (HTTP) or non-HTTPprotocols. These communications protocols may also include, for example,more reliable transport layer protocols, such as transmission controlprotocol (TCP), and less reliable transport layer protocols, such asuser datagram protocol (UDP). Data storage resources may include filestorage devices, block storage devices and the like.

Each type or configuration of computing resource may be available indifferent sizes, such as large resources—consisting of many processors,large amounts of memory and/or large storage capacity—and smallresources—consisting of fewer processors, smaller amounts of memoryand/or smaller storage capacity. Customers may choose to allocate anumber of small processing resources as web servers and/or one largeprocessing resource as a database server, for example.

Data center 65 may include servers 66 a and 66 b (which may be referredherein singularly as server 66 or in the plural as servers 66) thatprovide computing resources. These resources may be available as baremetal resources or as virtual machine instances 68 a-d (which may bereferred herein singularly as virtual machine instance 68 or in theplural as virtual machine instances 68). In at least some examples,server manager 67 may control operation of and/or maintain servers 66.Virtual machine instances 68 c and 68 d are rendition switching virtualmachine (“RSVM”) instances. The RSVM virtual machine instances 68 c and68 d may be configured to perform all, or any portion, of the techniquesfor improved rendition switching and/or any other of the disclosedtechniques in accordance with the present disclosure and described indetail above. As should be appreciated, while the particular exampleillustrated in FIG. 6 includes one RSVM virtual machine in each server,this is merely an example. A server may include more than one RSVMvirtual machine or may not include any RSVM virtual machines.

The availability of virtualization technologies for computing hardwarehas afforded benefits for providing large-scale computing resources forcustomers and allowing computing resources to be efficiently andsecurely shared between multiple customers. For example, virtualizationtechnologies may allow a physical computing device to be shared amongmultiple users by providing each user with one or more virtual machineinstances hosted by the physical computing device. A virtual machineinstance may be a software emulation of a particular physical computingsystem that acts as a distinct logical computing system. Such a virtualmachine instance provides isolation among multiple operating systemssharing a given physical computing resource. Furthermore, somevirtualization technologies may provide virtual resources that span oneor more physical resources, such as a single virtual machine instancewith multiple virtual processors that span multiple distinct physicalcomputing systems.

Referring to FIG. 6 , network 104 may, for example, be a publiclyaccessible network of linked networks and possibly operated by variousdistinct parties, such as the Internet. In other embodiments, network104 may be a private network, such as a corporate or university networkthat is wholly or partially inaccessible to non-privileged users. Instill other embodiments, network 104 may include one or more privatenetworks with access to and/or from the Internet.

Network 104 may provide access to user computers 62. User computers 62may be computers utilized by users 60 or other customers of data center65. For instance, user computer 62 a or 62 b may be a server, a desktopor laptop personal computer, a tablet computer, a wireless telephone, apersonal digital assistant (PDA), an e-book reader, a game console, aset-top box or any other computing device capable of accessing datacenter 65. User computer 62 a or 62 b may connect directly to theInternet (e.g., via a cable modem or a Digital Subscriber Line (DSL)).Although only two user computers 62 a and 62 b are depicted, it shouldbe appreciated that there may be multiple user computers.

User computers 62 may also be utilized to configure aspects of thecomputing resources provided by data center 65. In this regard, datacenter 65 might provide a gateway or web interface through which aspectsof its operation may be configured through the use of a web browserapplication program executing on user computer 62. Alternately, astand-alone application program executing on user computer 62 mightaccess an application programming interface (API) exposed by data center65 for performing the configuration operations. Other mechanisms forconfiguring the operation of various web services available at datacenter 65 might also be utilized.

Servers 66 shown in FIG. 6 may be servers configured appropriately forproviding the computing resources described above and may providecomputing resources for executing one or more web services and/orapplications. In one embodiment, the computing resources may be virtualmachine instances 68. In the example of virtual machine instances, eachof the servers 66 may be configured to execute an instance manager 63 aor 63 b (which may be referred herein singularly as instance manager 63or in the plural as instance managers 63) capable of executing thevirtual machine instances 68. The instance managers 63 may be a virtualmachine monitor (VMM) or another type of program configured to enablethe execution of virtual machine instances 68 on server 66, for example.As discussed above, each of the virtual machine instances 68 may beconfigured to execute all or a portion of an application.

It should be appreciated that although the embodiments disclosed abovediscuss the context of virtual machine instances, other types ofimplementations can be utilized with the concepts and technologiesdisclosed herein. For example, the embodiments disclosed herein mightalso be utilized with computing systems that do not utilize virtualmachine instances.

In the example data center 65 shown in FIG. 6 , a router 61 may beutilized to interconnect the servers 66 a and 66 b. Router 61 may alsobe connected to gateway 64, which is connected to network 104. Router 61may be connected to one or more load balancers, and alone or incombination may manage communications within networks in data center 65,for example, by forwarding packets or other data communications asappropriate based on characteristics of such communications (e.g.,header information including source and/or destination addresses,protocol identifiers, size, processing requirements, etc.) and/or thecharacteristics of the private network (e.g., routes based on networktopology, etc.). It will be appreciated that, for the sake ofsimplicity, various aspects of the computing systems and other devicesof this example are illustrated without showing certain conventionaldetails. Additional computing systems and other devices may beinterconnected in other embodiments and may be interconnected indifferent ways.

In the example data center 65 shown in FIG. 6 , a data center 65 is alsoemployed to at least in part direct various communications to, fromand/or between servers 66 a and 66 b. While FIG. 6 depicts router 61positioned between gateway 64 and data center 65, this is merely anexemplary configuration. In some cases, for example, data center 65 maybe positioned between gateway 64 and router 61. Data center 65 may, insome cases, examine portions of incoming communications from usercomputers 62 to determine one or more appropriate servers 66 to receiveand/or process the incoming communications. Data center 65 may determineappropriate servers to receive and/or process the incomingcommunications based on factors such as an identity, location or otherattributes associated with user computers 62, a nature of a task withwhich the communications are associated, a priority of a task with whichthe communications are associated, a duration of a task with which thecommunications are associated, a size and/or estimated resource usage ofa task with which the communications are associated and many otherfactors. Data center 65 may, for example, collect or otherwise haveaccess to state information and other information associated withvarious tasks in order to, for example, assist in managingcommunications and other operations associated with such tasks.

It should be appreciated that the network topology illustrated in FIG. 6has been greatly simplified and that many more networks and networkingdevices may be utilized to interconnect the various computing systemsdisclosed herein. These network topologies and devices should beapparent to those skilled in the art.

It should also be appreciated that data center 65 described in FIG. 6 ismerely illustrative and that other implementations might be utilized. Itshould also be appreciated that a server, gateway or other computingdevice may comprise any combination of hardware or software that caninteract and perform the described types of functionality, includingwithout limitation: desktop or other computers, database servers,network storage devices and other network devices, PDAs, tablets,cellphones, wireless phones, pagers, electronic organizers, Internetappliances, television-based systems (e.g., using set top boxes and/orpersonal/digital video recorders) and various other consumer productsthat include appropriate communication capabilities.

A network set up by an entity, such as a company or a public sectororganization, to provide one or more web services (such as various typesof cloud-based computing or storage) accessible via the Internet and/orother networks to a distributed set of clients may be termed a providernetwork. Such a provider network may include numerous data centershosting various resource pools, such as collections of physical and/orvirtualized computer servers, storage devices, networking equipment andthe like, used to implement and distribute the infrastructure and webservices offered by the provider network. The resources may in someembodiments be offered to clients in various units related to the webservice, such as an amount of storage capacity for storage, processingcapability for processing, as instances, as sets of related services,and the like. A virtual computing instance may, for example, compriseone or more servers with a specified computational capacity (which maybe specified by indicating the type and number of CPUs, the main memorysize and so on) and a specified software stack (e.g., a particularversion of an operating system, which may in turn run on top of ahypervisor).

A number of different types of computing devices may be used singly orin combination to implement the resources of the provider network indifferent embodiments, for example, computer servers, storage devices,network devices, and the like. In some embodiments, a client or user maybe provided direct access to a resource instance, e.g., by giving a useran administrator login and password. In other embodiments, the providernetwork operator may allow clients to specify execution requirements forspecified client applications and schedule execution of the applicationson behalf of the client on execution systems (such as application serverinstances, Java™ virtual machines (JVMs), general-purpose orspecial-purpose operating systems that support various interpreted orcompiled programming languages such as Ruby, Perl, Python, C, C++, andthe like, or high-performance computing systems) suitable for theapplications, without, for example, requiring the client to access aninstance or an execution system directly. A given execution system mayutilize one or more resource instances in some implementations; in otherimplementations, multiple execution systems may be mapped to a singleresource instance.

In many environments, operators of provider networks that implementdifferent types of virtualized computing, storage and/or othernetwork-accessible functionality may allow customers to reserve orpurchase access to resources in various resource acquisition modes. Thecomputing resource provider may provide facilities for customers toselect and launch the desired computing resources, deploy applicationcomponents to the computing resources and maintain an applicationexecuting in the environment. In addition, the computing resourceprovider may provide further facilities for the customer to quickly andeasily scale up or scale down the numbers and types of resourcesallocated to the application, either manually or through automaticscaling, as demand for or capacity requirements of the applicationchange. The computing resources provided by the computing resourceprovider may be made available in discrete units, which may be referredto as instances. An instance may represent a physical server hardwaresystem, a virtual machine instance executing on a server or somecombination of the two. Various types and configurations of instancesmay be made available, including different sizes of resources executingdifferent operating systems (OS) and/or hypervisors, and with variousinstalled software applications, runtimes and the like. Instances mayfurther be available in specific availability zones, representing alogical region, a fault tolerant region, a data center or othergeographic location of the underlying computing hardware, for example.Instances may be copied within an availability zone or acrossavailability zones to improve the redundancy of the instance, andinstances may be migrated within a particular availability zone oracross availability zones. As one example, the latency for clientcommunications with a particular server in an availability zone may beless than the latency for client communications with a different server.As such, an instance may be migrated from the higher latency server tothe lower latency server to improve the overall client experience.

In some embodiments, the provider network may be organized into aplurality of geographical regions, and each region may include one ormore availability zones. An availability zone (which may also bereferred to as an availability container) in turn may comprise one ormore distinct locations or data centers, configured in such a way thatthe resources in a given availability zone may be isolated or insulatedfrom failures in other availability zones. That is, a failure in oneavailability zone may not be expected to result in a failure in anyother availability zone. Thus, the availability profile of a resourceinstance is intended to be independent of the availability profile of aresource instance in a different availability zone. Clients may be ableto protect their applications from failures at a single location bylaunching multiple application instances in respective availabilityzones. At the same time, in some implementations inexpensive and lowlatency network connectivity may be provided between resource instancesthat reside within the same geographical region (and networktransmissions between resources of the same availability zone may beeven faster).

FIG. 7 is a block diagram illustrating another example system 700effective to perform anomaly detection in text, according to variousaspects of the present disclosure. The system 700 can be summarized toinclude the following computational steps: given an image includingtext, (a) optical character recognition (OCR) system is performed toidentify all the words, (b) group the extracted word-based image regionsinto sequences and pass them through the system comprising a ResNet18backbone feature extractor (or other feature extractor), followed by areasoning over the word types using a transformer head.

Transformers are often used in natural language processing tasks. Theircore component is attention and this mechanism has been successful innatural language processing (NLP) tasks such as machine translation andhas proved to be superior in leveraging context for sequence to sequencemodelling.

Modern networks employ transformers-like architectures that account forinformation from different representational states, thus eliminating thebottle-neck effect caused by the usual encoding-decoding schemes. Thisconsequence is mainly caused by the attempt to project the originalinput into a lower dimensional space. The compression due to encodingmight cause important information loss since it is very difficult inpractice to verify the assumption that the input data can besuccessfully projected into a fixed intermediate vector. In order todecide which piece of information is relevant with respect to thecontext of the input, an attention system is used.

In the system of FIG. 7 , instead of directly linking attention toconvolutional layers so that it taps into pixel information (or moregenerally instead of building custom encoders layers that should learnsituational embeddings) a basic architecture is used only for learning alocal statistic (e.g., differentiating between bold/non-bold words orlearning other context dependent information). The convolutional neuralnetwork backbone learns to extract relevant features (e.g., letterthickness) which are then used for comparison. As follows, thetransformer head does not need to utilize a large number of parametersand does not need to directly search for hints and focus on particularpieces of information. System 700 is modular and can be customized withany feature extractor in context dependent tasks where inputs need to becompared with each other.

In system 700, an image containing text is provided as input. First thebounding boxes containing words are extracted using an OCR component.Next, the word-level images are passed through a feature extractor Φ asa sequence to obtain specialized features for the bold versus non-boldproblem. As described below, system 700 may be used to perform othercontext-dependent tasks—such as identifying deviation from properexercise techniques, distinguishing players from referees in a soccermatch (without prior knowledge of uniform disparities), etc. Theobtained features are then passed through a transformer module Γ whichlearns an implicit statistic between the elements of the sequence.Finally, the features are processed by a decoding module Ψ to obtain thefinal class estimates for each word-level image patch.

Methodology

Given an image I ϵ

text, the goal is to classify all the word-level image patches as eitherbold or non-bold. For this task, grayscale images are used, as the colorcomponent is invariant for the problem at hand. Due to the exampleimplementation being a text-based type of problem, an OCR algorithm isused. Thus, we aimed to emphasize the importance of transformer headcapabilities for the problem at hand, on top of a predefined pool ofpredictions. However, it can be used with any region proposal mechanism.

After applying the OCR algorithm, for image I a pool of N word-levelimage patches is generated. These N word-level image patches are denotedby P={P_(iϵ{1 . . . N})|P_(i)ϵ

}, where N is the number of words from image I. For each P_(i) there isa corresponding binary label attached, marking it as either bold ornon-bold. The attached array of labels is denoted by L={0,1}^(N). Thus,the training samples comprises pairs (P_(j), L_(j)) where j=1 . . .N_(images) and N_(images) being the total number of data samples. Forease of notation and understanding, P_(j) and L_(j) are referred to as Pand L, respectively.

For each word-level image patch P_(i) ϵ P, a simpleforeground-background segmentation is applied to obtain the letter-levelsegmentation mask, denoted by S_(i) with i=1 . . . N. Basically, eachpixel corresponding to background is 0 and each pixel corresponding tothe letter content is 1. For each mask S_(i), two morphological basedoperations are applied: (1) skeletonization of the foreground maskinspired denoted by S_(i) ^(skel) and (2) distance transform applied onthe interior of the foreground mask denoted by S_(i) ^(dist). Theintuition behind applying these two types of operations is that S_(i)^(skel) should provide the inner most pixels with respect to the lettershape and S_(i) ^(dist) gives us the distance towards the closest edgeof the letter for those pixels.

As a result, for each word, a thickness measurement is obtained in theform ofθ(P _(i))={S _(i) ^(dist)(x, y)|S _(i) ^(skel)(x, y)=1, i=1 . . . N}  (1)

Basically, θ(P_(i)) provides a thickness measurement computed over allskeleton pixels of a word. If θ(P_(i)) is merged at the image level, athickness measurement is obtained that is aggregated over the entireimage I. This image measurement is denoted byθ_(I)={θ(P _(i))|i=1 . . . N}  (2)

Having a global thicknes measurement over the image I, in the form ofθ_(I), and the prior knowledge that ≈10% of the words are bold, thefollowing voting scheme is used to determine if a word is bold or not,

$\begin{matrix}{{B( P_{i} )} = \begin{Bmatrix}0 & {{{if}{\mu( {\theta( P_{i} )} )}} > {{{med}( \theta_{I} )} + {\alpha \cdot {\sigma( \theta_{I} )}}}} \\1 & {{{if}{\mu( {\theta( P_{i} )} )}} \leq {{{med}( \theta_{I} )} + {\alpha \cdot {\sigma( \theta_{I} )}}}}\end{Bmatrix}} & (3)\end{matrix}$

(3) where μ(·), σ(·) and med(·) represent the mean, the standarddeviation and the median function respectively. Parameter a is a votingthreshold value which requires validation.

Region-based Image Classifier

One other straight approach for this problem is by building a classifiermodel, where a training sample is considered as a word-level imagepatch, together with its corresponding binary label, (P_(j), L_(j)),independently of the image where it came from.

For this purpose, a word-level image-based model is constructed thatdisentangles the visual representation of the words into having adiscriminative feature for the task of bold-versus non-bold imageclassification. A multi-task ensemble may be used as an approach tosolving the problem:

A) a variational autoencoder (VAE) trained on letter-level imagepatches, which reconstructs an input image.

B) a classification head Ω placed on top of the VAE latentrepresentation trained for the bold versus non-bold classification task.

The intuition behind this is that the VAE model should understand theappearance representation in terms of font-style and font-attributes ofeach word and the classification head to constrain it into specializingfor the task of bold versus non-bold word detection.

For each of the word level images P_(j) from an image I, a score between0 and 1 is obtained using the Ω model. In order to obtain the bold wordswith respect to image I, two methods may be used: (1) select every wordwith a score above a certain threshold or (2) perform a clustering basedstrategy on the obtained scores using (z-score) statistics. Aspreviously described, the second approach may be advantageous to accountfor accurate classification in the face of font/typeface differences.This is achieved by computing the mean and standard deviation followedby filtering out those scores which have the highest distance withrespect to the mean of the set of scores. Basically, this acts as anoutlier detector for those word level images which stand out withrespect to the rest from I.

The main issue with this approach is that it learns to discriminatebetween word patches accross the entire dataset. However, given theobservation from 2, this may be problematic and the model may becomeconfused, thus assigning uncertain scores. That is why, it is importantfor this particular task to integrate the local-image context inside thelearning task.

Context Sensitive Transformer

Given the previously defined sequence P of word patches, a model shouldlearn their corresponding labelings L. This implies learning todistinguish between them locally, not globally. To accomplish thisparticular task, the proposed transformer model of system 700incorporates three distinctive components (I, II and III):

(I): Φ is a convolutional network designed for classification which actsas a feature extractor. It maps a patch of words P={P_(i)|i={1, . . . ,N}}to a sequence of embeddings E={E_(i)|i={1, . . . , N}}. For thispurpose, a light weight image classifier may be used. In variousexamples, the image classifier may be pre-trained on ImageNet (or someother dataset) and may be specialized for the desired use case. Thereason behind this is that the model is looking only at crops of wordsand the goal is to determine low-level image feature discriminative forthe task of bold versus non-bold word detection. In practice, the modelcomplexity should be adjusted according to the task at hand. In anexample implementation, the ResNet18 backbone network may be used as afeaturizer. The last classification layer may be discarded and a 1×1convolution may be added in order to match the output dimensionality ofΦ to the input of the transformer module.Φ(P)=E   (4)

(II): Γ, a light transformer head that performs attention over thesequence of embeddings E and outputs another sequence Ê_(t)={Ê_(i)^(t)|i=1, . . . N:t=1 . . . T } as a result of applying attention. Thetransformer head may comprise T stacked encoding layers, where eachencoding layer comprises multi-head attention, layer normalization, andfeed-forward layers together with some residual-block connections. Invarious examples, the positional encoding may be dropped if it is notrelevant to the task at hand (e.g., the bold word versus non-bold wordassignment is irrelevant to their image localization). It may bedesirable to let the transformer account for every patch independentlyof its position and leverage their Φ(P) feature representation to learnan implicit statistic between the sequence elements with respect tobinary classification given the local context.Ê ₀=Γ(Φ(P))   (5)Ê _(t)=Γ(Ê _(t−1))   (6)

(III): Ψ, a final decoding step that maps the transformer embedding tothe desired output representation. It may comprise a fully connectedlayer followed by a softmax layer that maps each element of Ê_(t) to atwo dimensional vector Ĺ={0,1}^(N), representing probabilities for eachclass (non-bold and bold respectively).CONSENT(P)=Ψ(Γ(Φ(p)))={circumflex over (L)}  (7)

Now that all the components of the model are defined, an objectivefunction may be defined to optimize the model's parameters.Traditionally, for this type of task a cross-entropy loss is used.However, given the fact that bold / non-bold use case deals with a classimbalance situation, focal loss may be used for this particular usecase. This is a variation of the cross-entropy loss designed forsituations when there is exists a class imbalance. In the bold/non-boldscenario, the loss function may be the following:L _(Γ)(L, {circumflex over (L)})=−Σ_(i)((1−L _(i) ^(t))^(y)log(L _(i)^(t))   (8)

where

$\begin{matrix}{L_{i}^{t} = \begin{Bmatrix}{\hat{L}}_{i} & {{{if}L_{i}} = 1} \\{1 - {\hat{L}}_{i}} & {otherwise}\end{Bmatrix}} & (9)\end{matrix}$

Attention can be described as computing alignments of different vectorsso as to find relationships between them. Although there exists multipleways of computing such alignments, the specific attention this modeluses is scaled dot product attention. Formally, let Q, K and V definequeries, keys and values. The goal is to match the input Q and K ofdimension d_(v) to the output V of dimension d_(v). To compute theresulting weights of the values, the alignment of each query and key iscalculated using dot product, a normalisation constant √{square rootover (d_(k))} and a final softmax function. The resulting weights arethen multiplied with V.

$\begin{matrix}{{{Att}( {Q,K,V} )} = {{{softmax}( \frac{{QK}^{T}}{\sqrt{d_{k}}} )}V}} & (10)\end{matrix}$

In the anomalous text use case, Q, K and V are all equal, thereforeself-attention is used.

Although various systems described herein may be embodied in software orcode executed by general purpose hardware as discussed above, as analternate the same may also be embodied in dedicated hardware or acombination of software/general purpose hardware and dedicated hardware.If embodied in dedicated hardware, each can be implemented as a circuitor state machine that employs any one of or a combination of a number oftechnologies. These technologies may include, but are not limited to,discrete logic circuits having logic gates for implementing variouslogic functions upon an application of one or more data signals,application specific integrated circuits having appropriate logic gates,or other components, etc. Such technologies are generally well known bythose of ordinary skill in the art and consequently, are not describedin detail herein.

The flowcharts and methods described herein show the functionality andoperation of various implementations. If embodied in software, eachblock or step may represent a module, segment, or portion of code thatcomprises program instructions to implement the specified logicalfunction(s). The program instructions may be embodied in the form ofsource code that comprises human-readable statements written in aprogramming language or machine code that comprises numericalinstructions recognizable by a suitable execution system such as aprocessing component in a computer system. If embodied in hardware, eachblock may represent a circuit or a number of interconnected circuits toimplement the specified logical function(s).

Although the flowcharts and methods described herein may describe aspecific order of execution, it is understood that the order ofexecution may differ from that which is described. For example, theorder of execution of two or more blocks or steps may be scrambledrelative to the order described. Also, two or more blocks or steps maybe executed concurrently or with partial concurrence. Further, in someembodiments, one or more of the blocks or steps may be skipped oromitted. It is understood that all such variations are within the scopeof the present disclosure.

Also, any logic or application described herein that comprises softwareor code can be embodied in any non-transitory computer-readable mediumor memory for use by or in connection with an instruction executionsystem such as a processing component in a computer system. In thissense, the logic may comprise, for example, statements includinginstructions and declarations that can be fetched from thecomputer-readable medium and executed by the instruction executionsystem. In the context of the present disclosure, a “computer-readablemedium” can be any medium that can contain, store, or maintain the logicor application described herein for use by or in connection with theinstruction execution system. The computer-readable medium can compriseany one of many physical media such as magnetic, optical, orsemiconductor media. More specific examples of a suitablecomputer-readable media include, but are not limited to, magnetic tapes,magnetic floppy diskettes, magnetic hard drives, memory cards,solid-state drives, USB flash drives, or optical discs. Also, thecomputer-readable medium may be a random access memory (RAM) including,for example, static random access memory (SRAM) and dynamic randomaccess memory (DRAM), or magnetic random access memory (MRAM). Inaddition, the computer-readable medium may be a read-only memory (ROM),a programmable read-only memory (PROM), an erasable programmableread-only memory (EPROM), an electrically erasable programmableread-only memory (EEPROM), or other type of memory device.

It should be emphasized that the above-described embodiments of thepresent disclosure are merely possible examples of implementations setforth for a clear understanding of the principles of the disclosure.Many variations and modifications may be made to the above-describedexample(s) without departing substantially from the spirit andprinciples of the disclosure. All such modifications and variations areintended to be included herein within the scope of this disclosure andprotected by the following claims.

What is claimed is:
 1. A method comprising: receiving a product labelcomprising an image of a plurality of words; generating, using opticalcharacter recognition (OCR), an image of a first word of the pluralityof words; generating a latent representation of the first word using avariational autoencoder; determining, by a classifier based on thelatent representation, a score indicating a likelihood that the image ofthe first word represents bolded, italicized, or highlighted textrelative to at least some other words of the plurality of words;training the variational autoencoder and the classifier using backpropagation of reconstruction loss and classification loss; determiningthat the first word represents an allergen by comparing the first wordto a list of allergens; determining, based at least in part on thescore, that the first word is not bolded; and preventing a productassociated with the product label from being listed on an e-commercesite based at least in part on the first word not being bolded.
 2. Themethod of claim 1, further comprising: obtaining a set of scores for theplurality of words from the classifier, wherein each score represents arespective likelihood that the image of the respective word representsbold, italicized, or highlighted text relative to the some other wordsof the plurality of words; determining a mean and standard deviation ofthe set of scores; determining a subset of scores that are statisticaloutliers relative to other members of the set of scores; and generatingoutput data indicating that images of words associated with the subsetof scores are bold, italicized, or highlighted.
 3. A method comprising:receiving image data representing a plurality of words; determining animage of a first word of the plurality of words; generating, by a firstmachine learning model, a first feature representation of the firstword, wherein the first machine learning model comprises a variationalautoencoder or a transformer; generating, by a classifier network afirst score based at least in part on the first feature representation,wherein the classifier network is trained to determine the first scorerepresenting a likelihood that an appearance of the first word in theimage of the first word has a different style attribute from at leastsome other words of the plurality of words based at least in part on thefirst feature representation of the first word, wherein the differentstyle attribute is one or more of the following: bold, italics, orhighlighting; determining, based at least in part on the first score,that the first word is not bolded, italicized, or highlighted; andgenerating output data indicating that a first product associated withthe image data cannot be listed on an e-commerce site.
 4. The method ofclaim 3, further comprising: generating, by the first machine learningmodel, a conditional distribution representing the image of the firstword; determining a mean of the conditional distribution; inputting themean of the conditional distribution into the classifier network; andoutputting the first score from the classifier network.
 5. The method ofclaim 3, further comprising: generating a set of respective scores foreach word of the plurality of words; determining a mean score andstandard deviation for the set; and determining a subset of scores thatare outliers with respect to the other scores of the set based at leastin part on the mean score and standard deviation.
 6. The method of claim5, further comprising: determining a set of words of the plurality ofwords that are associated with the subset of scores; and generatingsecond output data indicating that an appearance of each word of the setof words is emphasized in the plurality of words.
 7. The method of claim3, further comprising: generating, by the first machine learning model,the first feature representation comprising a latent distributionrepresenting the image of the first word; sampling a point from thelatent distribution; generating, by a decoder network, a reconstructiondata representing the image of the first word; and determining areconstruction loss representing a difference between the reconstructiondata and the image of the first word.
 8. The method of claim 7, whereinthe first machine learning model comprises the variational autoencoder,the method further comprising training the variational autoencoder basedat least in part on the reconstruction loss.
 9. The method of claim 3,wherein the first machine learning model comprises the variationalautoencoder, the method further comprising: generating the first featurerepresentation using the variational autoencoder; and training thevariational autoencoder and the classifier network together based atleast in part on reconstruction loss and classifier loss.
 10. The methodof claim 3, wherein the first machine learning model comprises thevariational autoencoder, the method further comprising: sending firstdata from an intermediate layer of an encoder of the variationalautoencoder to a layer of a decoder of the variational autoencoder,wherein the first feature representation is generated based at least inpart on a reconstruction loss determined using the first data.
 11. Themethod of claim 3, further comprising: generating a second featurerepresentation of the first word, the second feature representationrepresenting a relationship between a first height of the first word, inpixels, and a first length of the first word, in pixels; and sending thesecond feature representation to the classifier network, wherein thefirst score is further generated based at least in part on the secondfeature representation.
 12. A system comprising: at least one processor;and non-transitory computer-readable memory storing instructions that,when executed by the at least one processor, are effective to: receiveimage data representing a plurality of words; determine an image of afirst word of the plurality of words; generate, by a first machinelearning model, a first feature representation of the first word,wherein the first machine learning model comprises a variationalautoencoder or a transformer; generate, by a classifier network a firstscore based at least in part on the first feature representation,wherein the classifier network is trained to determine the first scorerepresenting a likelihood that an appearance of the first word in theimage of the first word has a different style attribute from at leastsome other words of the plurality of words based at least in part on thefirst feature representation of the first word, wherein the differentstyle attribute is one or more of the following: bold, italics, orhighlighting; determine, based at least in part on the first score, thatthe first word is not bolded, italicized, or highlighted; and generateoutput data indicating that a first product associated with the imagedata cannot be listed on an e-commerce site.
 13. The system of claim 12,the non-transitory computer-readable memory storing further instructionsthat, when executed by the at least one processor, are further effectiveto: generate, by the first machine learning model, a conditionaldistribution representing the image of the first word; determine a meanof the conditional distribution; input the mean of the conditionaldistribution into the classifier network; and output the first scorefrom the classifier network.
 14. The system of claim 12, thenon-transitory computer-readable memory storing further instructionsthat, when executed by the at least one processor, are further effectiveto: generate a set of respective scores for each word of the pluralityof words; determine a mean score and standard deviation for the set; anddetermine a subset of scores that are outliers with respect to the otherscores of the set based at least in part on the mean score and standarddeviation.
 15. The system of claim 14, the non-transitorycomputer-readable memory storing further instructions that, whenexecuted by the at least one processor, are further effective to:determine a set of words of the plurality of words that are associatedwith the subset of scores; and generate second output data indicatingthat an appearance of each word of the set of words is emphasized in theplurality of words.
 16. The system of claim 12, the non-transitorycomputer-readable memory storing further instructions that, whenexecuted by the at least one processor, are further effective to:generate, the first machine learning model, the first featurerepresentation comprising a latent distribution representing the imageof the first word; sample a point from the latent distribution;generate, by a decoder network, a reconstruction data representing theimage of the first word; and determine a reconstruction lossrepresenting a difference between the reconstruction data and the imageof the first word.
 17. The system of claim 16, wherein the first machinelearning model comprises the variational autoencoder, and wherein thenon-transitory computer-readable memory stores storing furtherinstructions that, when executed by the at least one processor, arefurther effective to train the variational autoencoder based at least inpart on the reconstruction loss.
 18. The system of claim 12, thenon-transitory computer-readable memory storing further instructionsthat, when executed by the at least one processor, are further effectiveto: determine that the image of the first word is anomalous with respectto the at least some other words of the plurality of words based atleast in part on the first score; determine ground truth data indicatingwhether the image of the first word is anomalous; and train theclassifier network using a training instance comprising the firstfeature representation and the ground truth data, wherein the classifiernetwork receives the first feature representation of the first word asan input and outputs the first score.